多语言语音识别已引起大幅关注,作为补偿低资源语言数据稀缺性的有效方法。端到端(E2E)建模比常规混合系统优选,这主要是由于没有词典要求。但是,在有限的数据方案中,混合DNN-HMM仍然优于E2E模型。此外,手动词典创建的问题已通过公开训练的素式训练型(G2P)(G2P)和多种语言的IPA音译来缓解。在本文中,在低资源语言的多语言设置中提出了一种混合DNN-HMM声学模型的新型方法。针对目标语言语言信号的不同单语言模型的后验分布融合在一起。为每个源目标语言对训练了一个单独的回归神经网络,以将后者从源声学模型转换为目标语言。与ASR培训相比,这些网络需要非常有限的数据。与多语言和单语基线相比,后融合的相对增益分别为14.65%和6.5%。跨语性模型融合表明,无需使用依赖语言的ASR的后代,就可以实现可比的结果。
translated by 谷歌翻译
Creativity is an indispensable part of human cognition and also an inherent part of how we make sense of the world. Metaphorical abstraction is fundamental in communicating creative ideas through nuanced relationships between abstract concepts such as feelings. While computer vision benchmarks and approaches predominantly focus on understanding and generating literal interpretations of images, metaphorical comprehension of images remains relatively unexplored. Towards this goal, we introduce MetaCLUE, a set of vision tasks on visual metaphor. We also collect high-quality and rich metaphor annotations (abstract objects, concepts, relationships along with their corresponding object boxes) as there do not exist any datasets that facilitate the evaluation of these tasks. We perform a comprehensive analysis of state-of-the-art models in vision and language based on our annotations, highlighting strengths and weaknesses of current approaches in visual metaphor Classification, Localization, Understanding (retrieval, question answering, captioning) and gEneration (text-to-image synthesis) tasks. We hope this work provides a concrete step towards developing AI systems with human-like creative capabilities.
translated by 谷歌翻译
Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. In this work, we improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions. To do this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, we can better preserve the compositional semantics in the generated image by manipulating the cross-attention representations based on linguistic insights. Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention design is efficient that requires no additional training samples. We achieve better compositional skills in qualitative and quantitative results, leading to a 5-8% advantage in head-to-head user comparison studies. Lastly, we conduct an in-depth analysis to reveal potential causes of incorrect image compositions and justify the properties of cross-attention layers in the generation process.
translated by 谷歌翻译
Applications such as employees sharing office spaces over a workweek can be modeled as problems where agents are matched to resources over multiple rounds. Agents' requirements limit the set of compatible resources and the rounds in which they want to be matched. Viewing such an application as a multi-round matching problem on a bipartite compatibility graph between agents and resources, we show that a solution (i.e., a set of matchings, with one matching per round) can be found efficiently if one exists. To cope with situations where a solution does not exist, we consider two extensions. In the first extension, a benefit function is defined for each agent and the objective is to find a multi-round matching to maximize the total benefit. For a general class of benefit functions satisfying certain properties (including diminishing returns), we show that this multi-round matching problem is efficiently solvable. This class includes utilitarian and Rawlsian welfare functions. For another benefit function, we show that the maximization problem is NP-hard. In the second extension, the objective is to generate advice to each agent (i.e., a subset of requirements to be relaxed) subject to a budget constraint so that the agent can be matched. We show that this budget-constrained advice generation problem is NP-hard. For this problem, we develop an integer linear programming formulation as well as a heuristic based on local search. We experimentally evaluate our algorithms on synthetic networks and apply them to two real-world situations: shared office spaces and matching courses to classrooms.
translated by 谷歌翻译
Prompt tuning is a new few-shot transfer learning technique that only tunes the learnable prompt for pre-trained vision and language models such as CLIP. However, existing prompt tuning methods tend to learn spurious or entangled representations, which leads to poor generalization to unseen concepts. Towards non-spurious and efficient prompt learning from limited examples, this paper presents a novel \underline{\textbf{C}}ounterfactual \underline{\textbf{P}}rompt \underline{\textbf{L}}earning (CPL) method for vision and language models, which simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework. Particularly, CPL constructs counterfactual by identifying minimal non-spurious feature change between semantically-similar positive and negative samples that causes concept change, and learns more generalizable prompt representation from both factual and counterfactual examples via contrastive learning. Extensive experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks than previous prompt tuning methods on CLIP. On image classification, we achieve 3.55\% average relative improvement on unseen classes across seven datasets; on image-text retrieval and visual question answering, we gain up to 4.09\% and 25.08\% relative improvements across three few-shot scenarios on unseen test sets respectively.
translated by 谷歌翻译
许多情况下,具有限制代理商竞争资源的代理商可以作为两分图上的最大匹配问题施放。我们的重点是资源分配问题,在这些问题上,代理可能会限制与某些资源不兼容的限制。我们假设一个原理可以随机选择最大匹配,以便每个代理都具有一定概率的资源。代理商希望通过在一定范围内修改限制来提高他们的匹配机会。原则的目标是建议一个不满意的代理商放松其限制,以便放松的总成本在预算范围内(代理商选择),并最大程度地提高了分配资源的可能性。我们为这种预算受限的最大化问题的某些变体建立硬度结果,并为其他变体提供算法结果。我们通过实验评估合成数据集以及两个新颖的现实数据集:度假活动数据集和一个教室数据集的方法。
translated by 谷歌翻译
如今,瑜伽因现代生活方式的压力增加而受到全世界的关注,并且学习瑜伽有很多方法或资源。瑜伽一词意味着思想和身体之间的深厚联系。今天,有大量的医学和科学证据表明,我们大脑活动的基本面,我们的化学甚至可以通过练习不同的瑜伽系统来改变我们的化学。 Suryanamaskar,也被称为“向太阳致敬”,是一种瑜伽练习,结合了八种不同的形式和12个体式(4个Asana重复),专门介绍了印度太阳神Surya。 Suryanamaskar提供了许多健康益处,例如增强肌肉和帮助控制血糖水平。在这里,MediaPipe库用于分析Surya Namaskar的情况。高级软件可以实时检测到站立,因为人们在相机前表演了Surya Namaskar。班级分隔器将该表格识别为以下一项:pranamasana,hasta padasana,hasta uttanasana,ashwa -Sanchalan Asana,Ashtanga Namaskar,Dandasana或Bhujangasana和Svanasana。基于深度学习的技术(CNN)用于开发该模型,模型精度为98.68%,精度得分为0.75,以检测正确的瑜伽(Surya Namaskar)姿势。使用此方法,用户可以练习所需的姿势,并可以检查该人所做的姿势是否正确。它将有助于正确地做Surya Namaskar的所有不同姿势,并提高瑜伽从业者的效率。本文描述了将在模型中实现的整个框架。
translated by 谷歌翻译
在本文中,我们专注于改进二进制2D实例细分,以帮助人类用多边形标记地面真相数据集。人类的标签只需要在物体周围绘制盒子,然后自动生成多边形。为了有用,我们的系统必须实时运行CPU。二进制实例细分的最常见方法涉及编码器折叠网络。本报告评估了最先进的编码器 - 码头网络,并提出了一种使用这些网络改善实例分割质量的方法。除了网络体系结构的改进之外,我们提出的方法还依靠为网络输入,所谓的极端点(即对象轮廓上的最外部点)提供额外的信息。用户可以几乎尽快给它们标记它们,而不是边界框。边界框也可以从极端点推导。与其他最先进的编码器网络相比,此方法可产生更好的IOU,并且在将其部署在CPU上时也足够快。
translated by 谷歌翻译
现实世界的面部表达识别(FER)数据集遭受吵闹的注释,由于众包,表达式的歧义,注释者的主观性和类间的相似性。但是,最近的深层网络具有强大的能力,可以记住嘈杂的注释导致腐蚀功能嵌入和泛化不良的能力。为了处理嘈杂的注释,我们提出了一个动态FER学习框架(DNFER),其中根据训练过程中的动态类特定阈值选择了干净的样品。具体而言,DNFER基于使用选定的干净样品和使用所有样品的无监督培训的监督培训。在训练过程中,每个微型批次的平均后类概率被用作动态类特异性阈值,以选择干净的样品进行监督训练。该阈值与噪声率无关,与其他方法不同,不需要任何干净的数据。此外,要从所有样品中学习,使用无监督的一致性损失对齐弱调节图像和强大图像之间的后验分布。我们证明了DNFER在合成和实际噪声注释的FER数据集(如RaFDB,Ferplus,Sfew和altimpnet)上的鲁棒性。
translated by 谷歌翻译
自动情感识别在许多领域都有应用,例如教育,游戏,软件开发,汽车,医疗保健等。但是,在野外数据集上实现可观的绩效是无琐的任务。野外数据集虽然比合成数据集更好地代表了现实世界中的情况,但前者遇到了不完整标签的问题。受到半监督学习的启发,在本文中,我们在第四次情感行为分析(ABAW)2022竞赛中介绍了提交的多任务学习挑战。在这项挑战中考虑的三个任务是价估计(VA)估计,表达式分为6个基本(愤怒,厌恶,恐惧,幸福,悲伤,惊喜),中立和“其他”类别和12个行动单位(au)编号au - \ {1,2,4,6,7,10,12,15,15,23,24,25,26 \}。我们的方法半监督的多任务面部情感情感识别标题为\ textbf {ss-mfar}使用一个深层残留网络,每个任务都具有特定任务分类器以及每个表达式类别的自适应阈值,每个表达式类别和半监督学习。源代码可从https://github.com/1980x/abaw202​​22dmacs获得。
translated by 谷歌翻译